-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50639][SQL] Improve warning logging in CacheManager #49276
base: master
Are you sure you want to change the base?
Conversation
@hvanhovell please review |
@gengliangwang can you take a look |
val shouldRemove: LogicalPlan => Boolean = | ||
if (cascade) { | ||
_.exists(isMatchedPlan) | ||
} else { | ||
isMatchedPlan | ||
} | ||
val plansToUncache = cachedData.filter(cd => shouldRemove(cd.plan)) | ||
var plansToUncache: IndexedSeq[CachedData] = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov why do we need the code change here if this PR is to improve the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang logging relies on the plansToUncache.nonEmpty
added on line 288 to correctly log warnings, so the change is necessary to prevent race condition (cacheData
should be accessed under synchronized
).
@gengliangwang please check my reply |
Change in #45990 provides troubleshooting options for the |
@gengliangwang, @anchovYu please check my reply. |
@gengliangwang My understanding is that log level was set to TRACE intentionally and it should not be enabled by default. Please see comment on #45990
Note that warnings on line 129 and 145 coexist with changes from #45990 and provide early problem notification. This PR originates from a real issue where I spent large amount of time first isolating memory leak to the |
Can you provide more details. I think they are similar. Please check the method calls of |
@gengliangwang Please check lines 129 and lines 145. Those are pre-existing warnings (existed prior to #45990 and were not removed as part of #45990) and they use
They are related, but serve different purpose. Entries that use |
@gengliangwang Please check my reply. To clarify why I think #45990 and #49276 are related but do not overlap: changes in #45990 log (trace by default) messages when the cache is modified (an item is added or removed from the cache), while changes in #49276 log warning messages when the cache is expected to be modified by a user, but is not modified (an item is not added or is not removed) along with details about Dataset that is used in the call to |
@gengliangwang Please check my reply. |
@@ -126,7 +126,9 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper { | |||
if (storageLevel == StorageLevel.NONE) { | |||
// Do nothing for StorageLevel.NONE since it will not actually cache any data. | |||
} else if (lookupCachedDataInternal(normalizedPlan).nonEmpty) { | |||
logWarning("Asked to cache already cached data.") | |||
logWarning(log"An attempt was made to cache data even though the data had already been " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov so in the method call of lookupCachedDataInternal
, it will output
CacheManager.logCacheOperation(log"Dataframe cache hit for input plan:" +
log"\n${MDC(QUERY_PLAN, plan)} matched with cache entry:" +
log"${MDC(DATAFRAME_CACHE_ENTRY, result.get)}")
while in the change here it suggests users to uncache instead of cache is hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log from #45990 is more accurate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not. The cache hit is irrelevant in this case as the caller(user) does not expect entry to be present in the cache and the user explicitly calls persist()
while cache lookup is an internal operation not initiated by the caller. The warning message in #49276 means that user either does not need to call persist()
(and should remove it) or there is missing/wrong call to unpersist()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang Please see my response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang Please see my response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov can you explain why we need to uncache data or clear cache? If the plan is already cached, then we don't need to cache it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang sounds that we are going in a circle here. The warning message is logged when there is a call to persist()
when DataSet is already present in the cache, so it is an indication of a bug in the code that calls persist()
. Either there is a missing call to unpersist()
on the Dataset (there may be a call to unpersist()
on another Dataset that was not cached as in the sample I provided in the PR comment) or the call is not necessary at all. The warning message is a way for the user to be notified that there is a problem (bug) in the code and it is up to the user to see how to fix it. If you have better suggestion on the warning message wording, please post it here. As far as I can see, the proposed warning message indicates that
- the data had already being cached, so the call may not be necessary
- there may be a missing call to
unpersist()
or clear cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please un-cache data or clear cache first.\nLogical plan:\n"
Asking users to call unpersist
is unnecessary and super confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I disagree that it is super confusing and explained when call to
unpersist()
may be necessary. - To move this PR forward I updated warning message and removed reference to un-cache data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang This code change is to provide added-value for warning log as developers can easily identify which query_plan was already persisted.
-
Before the change: The warning log only showed
Asked to cache already cached data
. Developers can not identify which query_plan was already cached from the warning message. For large project, it means the warning does not add value to the user as there might be too many dataframe in the project. -
After the change: The warning log showed which query_plan was already cached. Then developers can easily check their code to identify the unnecessary cache/persist for specific dataframe.
uncacheByCondition(spark, _.sameResult(plan), cascade, blocking) | ||
if (!uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)) { | ||
logWarning(log"Data has not been previously cached or it was removed from the " + | ||
log"cache already.\nLogical plan:\n${MDC(QUERY_PLAN, plan)}") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang This log is to warn that developers are trying to unpersist a query_plan which has not been previously cached and show the related query plan details.
For example, this sample pyspark code is trying to unpersist a redefined a dataframe. This leavs the query plan of the original cached dataframe in CacheManager. If this happens in for loop or spark structured streaming foreachbatch, the driver memory will constantly increase and lead to memory issue.
df = spark.createDataFrame(data, ["name", "age", "city"])
df.persist()
df.show()
df = df.withColumn("NAME", upper(col("name")))
df.show()
df.unpersist()
The proposed change here is to help developers easily identify they are trying to unpersist a query_plan which has not been previously cached. Then developers can review their code to confirm whether they are unpersisting a wrong dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov @gengliangwang I added comments for two warning log changes and illustrated their added value for developers. Please let me know if any changes needed to move this PR forward.
@gengliangwang Please review |
1 similar comment
@gengliangwang Please review |
@@ -126,7 +126,9 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper { | |||
if (storageLevel == StorageLevel.NONE) { | |||
// Do nothing for StorageLevel.NONE since it will not actually cache any data. | |||
} else if (lookupCachedDataInternal(normalizedPlan).nonEmpty) { | |||
logWarning("Asked to cache already cached data.") | |||
logWarning(log"An attempt was made to cache data even though the data had already been " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is very similar to the changes in https://github.com/apache/spark/pull/45990/files#diff-88635a13b65f19dcc80b865d903b498b8328607f96c088402a8ebdbb857eedf9R303. I don't think we should have two duplicated logs with different log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang The log entry is not added as part of this PR, so I don't follow your concern.
@gengliangwang Please see my response to your comment |
@gengliangwang ^^^ |
@hvanhovell @dongjoon-hyun Please take a look. |
What changes were proposed in this pull request?
The change improves warning logging in the CacheManager by:
Why are the changes needed?
The change helps to identify incorrect calls to
Dataset.persist()
andDataset.unpersist()
as inDoes this PR introduce any user-facing change?
Users may see warning messages like:
and
How was this patch tested?
The change modifies warning log messages.
Was this patch authored or co-authored using generative AI tooling?
No.